Monster Media 1996 #15

home *** CD-ROM | disk | FTP | other *** search

/ Monster Media 1996 #15 / Monster Media Number 15 (Monster Media)(July 1996).ISO / internet / htmst604.zip / HTMSTRIP.DOC < prev next >

Wrap

Text File | 1996-04-13 | 18KB | 351 lines

HTMSTRIP.DOC 1 Revised: 04/13/96 The HTMSTRIP.EXE program attempts to read HTML pages, remove the HTML coding, and write the file out as something more useful. Features of this program: * Can be run across an entire subdirectory (for example, your entire cache subdirectory), and will only process the HTML documents that it finds. (There are some options on this.) * Removes all imbedded HTML commands. * Recodes the standard HTML "entity references" (e.g. "©" becomes "(c)"). The actual replacements are coded in a user-modifiable lookup file. * Handles standard indent, heading, selection groups, menus, tables, etc. * Reflows all text as appropriate * Optionally, will replace Link, Image, and Input references with user-definable text representations. * Optionally, alerts you to possible errors in the HTML code itself. HTML codes are surrounded within <...> indicators. For upward compatibility reasons, Web browsers ignore any codes that they don't understand and just process the ones they can handle. Note that the HTMSTRIP command is currently geared for handling HTML 2.0 files and then Netscape table-specific extensions (added to HTML 3.0). HTMSTRIP removes all HTML codes. It also handles the standard HTML "&xxx;" "entity references" (e.g. "©" is replaced by "(c)"). You can add or change these replacements as desired by using the INI file (documented later). HTMSTRIP is also tuned to allow it to specially-handle several embedded HTML codes. These codes are the following: <A ...> External link <BLOCKQUOTE>...</BLOCKQUOTE> Indented block of text <BR> Forced line break <CAPTION>...</CAPTION> Title for a table <CENTER>...</CENTER> Centering text <DD> Term definition <DIR>...</DIR> Directory list of items </DL> End of definition list <DT> First term of definition list/glossary <H1> to <H6>...</H1> to </H6> Heading items <HR> Horizontal rule <IMG ...> Image <INPUT ...> User input <LI> Menu/Ordered/Unordered/Directory list item <MENU>...</MENU> Menu listing <OL>...</OL> Ordered listing <OPTION> Used for single/multiple choice menus <P> Paragraph indicator <PRE>...</PRE> Preserve spacing block (preformatted text) <SELECT>...</SELECT> Block for single/multiple choice menu <TABLE>...</TABLE> Table block <TD>...</TD> Table data (cell) <TH>...</TH> Table heading <TITLE>...</TITLE> Title item <TR>...</TR> Table row <UL>...</UL> Unordered listing HTMSTRIP.DOC 2 Revised: 04/13/96 If you run across other codes that become vital, let me know and I'll try to handle them somehow. How to get HTML files: Some people who are using regular Web browsers like Mosaic or Netscape don't realize that they're automatically saving HTML files to their hard disk throughout every Web session. That's because just about every Web browser saves the most-recently accessed files from the Web (including HTML source code, GIF's, and JPG's) on your hard disk and reads them from there instead of requiring you to download them every time you go back to a previous page. This is typically settable by you under "Preferences" and "Cache" on your Web browser. I usually set my Web browser to have a huge cache, maybe 10MB. Anything beats downloading the same pages over again even at 28.8K. And I make sure that I do not have anything specified like "clear cache at the end of every session". Then I just go through the files in the cache subdirectory afterward and reprocess them. Two disadvantages to a cache... It takes up hard disk space but, hey, the Web browser is typically in Windows so why are you surprised. The second disadvantage is that if the page actually changes between sessions, you typically won't notice the new page as long as it remains in your cache. If you think a page is still in cache and should have been changed but didn't, you can typically ask your Web browser to reload the page. On some browsers, this is shown as an arrow in the form of a circle. HTMSTRIP can process the entire cache subdirectory. It automatically detects non-HTML files for you and processes accordingly. It creates new text file versions of just the HTML pages it finds. By the way, for some reason, the current beta version of Netscape typically ignores my cache setting for some reason. I don't have the slightest idea why. As a result, when you Alt-F4 out of Netscape, it goes through and deletes all but a few of the temporary files. This is annoying to say the least. As a result, I have to run HTMSTRIP from a DOS window just before leaving Netscape. If anyone knows why it does this to me, please let me know! Specifying parameters: Parameters for this program can be set in the following ways. The last setting encountered always wins: - Read from an *.INI file (see BRUCEINI.DOC file), - Through the use of an environmental variable (SET HTMSTRIP=whatever), or - From the command line (see "Syntax" below) HTMSTRIP.DOC 3 Revised: 04/13/96 Defining entity references: HTMSTRIP will process an entity reference definition file is one is found. This table can be in your standard *.INI file (e.g. HTMSTRIP.INI) if desired or it can be a separate file specified using the /Linitfile parameter. Entity references are how non-standard characters like the copyright character are handled in HTML pages. Entity references are indicated as "&xxx;" where "xxx" is either a code or a number preceded by a pound sign. The copyright symbol is indicated in HTML as "©". A default HTMSTRIP.INI is provided with over 230 entity reference lookups. To define or change these lookups, the INI file should include a series of lines in the following format: &xxx; = outstr where "&xxx;" is the HTML sequence and "outstr" is what you want to replace it with. The "outstr" portion can consist of regular non-space ASCII text characters (like "A" or "z") as well as hexadecimal values (in the form &Hxx) or decimal values (in the form \nnn). (See the BRUCEHEX.DOC file.) It can also be the word "NULL" which translates the string into nothing. You cannot use a space or equal sign in "outstr"; use the hexadecimal or decimal representations instead. The table does not have to be in any specified order. Lines can end with "/*" followed by a comment if you want. Examples: © = (c) /* Copyright symbol ° = ° é = é ê = ê è = è = \032 Remember that "&xxx;" entity references (yes, I hate that phrase) are case-sensitive in HTML. "°" will not find "&Deg;". There seems to be a trend of late to relax some of the replacement coding requirements in Web pages. The ";" is now, apparently, becoming optional. Numeric replacements (e.g. " ") seem to no longer require the leading pound sign. Therefore, HTMSTRIP looks for both of these iterations for any appropriate lookup. "©" will find "©" and "™" will find "&153". The lookup itself has to be entered in the formally correct way though. You are also allowed to redefine the strings that are used for three symbolic references in the file. These show up only if /SYMBOLS is specified. By default, you will see the following: for <A> external links -> (link) for <IMG> image references -> (image) for <INPUT> user inputs -> [Input] HTMSTRIP.DOC 4 Revised: 04/13/96 You can redefine any and all of these entity references in the same lookup file. These substitutions are specified more or less like the previous substitutions: <A> = (link) <IMG> = (image) <INPUT> = [Input] Unlike with the other lookups, the left side is not case sensitive so "<a>=(link)" works just fine. Hexadecimal and decimal replacements are again acceptable (see BRUCEHEX.DOC file). You might, for example, want to redefine some of them like this: <A> = \251 /* Replaces with a √ symbol <IMG> = \015 /* Replaces with a symbol (little flash cube) <INPUT> = ? /* Replaces with a question mark Any symbolic references that you do not redefine will default to their original values. If /-SYMBOLS is specified, any symbolic definitions are ignored and a "NULL" replacement string is used for all of them. HTMSTRIP.DOC 5 Revised: 04/13/96 Syntax: HTMSTRIP { filespec | @listfile } [ outfile ] [ /EXT=.xxx ] [ /WIDTH=n ] [ /SYMBOLS | /-SYMBOLS ] [ /ALL ] [ /SITE | /FSITE | /-SITE ] [ /ALT ] [ /SPACES | /-SPACES ] [ /WARNINGS | /-WARNINGS ] [ /RULE=s ] [ /BORDER=c ] [ /BUFF=n ] [ /Iinitfile | /-I ] [ /Linitfile ] [ /? ] [ /?&H ] where: "filespec" tells the routine which file or files are to be processed. The specification can include path and wildcards if desired. Typically, the file names are *.HTM files. "@listfile" allows you to have a variety of file specifications saved in a text file named "listfile". Each line in the file should consist of one file specification, each of which can include a path and wildcards if desired. Blank lines and lines beginning with semi-colons, colons, or quotes are ignored. "outfile" is the name of the output file to create. Is overwritten if it exists already. If no output file name is provided, the routine will use the infile and provide an extension of *.OUT. (The default .OUT extension can be overridden using the /EXT=.xxx parameter.) An outfile cannot be specified if wildcards or @listfile are used for the input file specification. "/EXT=.xxx" allows you to specify a different default file extension for the output file. This parameter only matters if you do not explicitly specify an output file name. Initially defaults to "/EXT=.OUT". "/WIDTH=n" specifies the desired line length for wrapping long lines and also for centering. Initially defaults to "/WIDTH=80". "/SYMBOLS" says to allow (unless redefined in your INI file) the "(link)", "(image)", and "[Input]" indicators. Initially defaults to "/-SYMBOLS". "/-SYMBOLS" skips the indicators even if they're defined in your INI file. This is initially the default. "/ALL" says that if the program encounters what it thinks is just a text file, it should take the file and try to fix up CR/LF problems (Unix files end with LF's instead of CR/LF which is what DOS needs) and that's it. This can be somewhat risky since it may misdiagnose a file but it should be safe if you're running it on your cache subdirectory. Initially defaults to "/-ALL" which means it won't process it unless it thinks it's an HTML file. "/SITE" shows the name of any <A HREF=...> location in the output file. For example, if a link goes to a specific Web page, the output file may include some reference like [http://www.thex-files.com/upepis.htm/]. Initially defaults to "/-SITE" (do not show the site name). "/FSITE" is similar to /SITE except all of the references are shown as footnotes instead of being left in the text itself. Initially defaults to "/-SITE". "/-SITE" shows, at best, the symbolic reference if a link is provided on a page. Instead of some [http://...] thing, you'll see (link) provided that /SYMBOLS are turned on. Initially defaults to "/-SITE". HTMSTRIP.DOC 6 Revised: 04/13/96 "/ALT" turns on the printing of the "Alt=" indicator in an <IMG...> statement. These are sometimes created by the page designer for use on buttons for viewers who don't have graphical support. Since text-only Web browsers are dying out, this is probably a standard which won't continue forever but it can't hurt. If /ALT is specified, these alternate texts show up independently of the /SYMBOLS setting. Initially defaults to "/-ALT". "/-ALT" prevents the Alt= text in <IMG...> statements from showing up. This is initially the default. "/SPACES" turns off extra vertical spacing between sections. There are frequently lots of extra blank lines that appear in the output file either due to specific HTML requests or to insure proper reformatting. Specifying /SPACES allows these to stay there. "/-SPACES" removes these extra blank lines. This is initially the default. "/WARNINGS" displays warnings when HTMSTRIP finds either internal problems in the document or things it can't handle. Initially defaults to "/-WARNINGS". "/-WARNINGS" turns off the warning messages. This is initially the default. "/RULE=s" specifies that a string is to be repeated the width of the line. This is used to separate sections. The string can be a single character (like "/RULE=-"), multiple characters (like "/RULE="- ""), it can contain decimal and hexadecimal characters (like "/RULE=\066\097\116"--see BRUCEHEX.DOC), it can be "/RULE=NULL" (which typically results in a blank line), or just simply "/RULE" (which is the same thing as "/RULE=-" if /BORDER=T and "RULE=\196" if /BORDER=S or /BORDER=D). Personally, if your printer supports IBM graphics characters, I find "/RULE=\196" to be the most pleasing of the rule lines. "/BORDER=c" specifies the type of border to use. The possible choices for "c" are "D" (double), "S" (single), "T" (text), "B" (blanks), or "N" (none). /BORDER=B shows spaces instead of delimiters whereas /BORDER=N skips the blank lines between cells entirely.. Examples of the other three: <T>ext <S>ingle <D>ouble +-----+-----+-----+ ┌─────┬─────┬─────┐ ╔═════╦═════╤═════╗ | 1 | 2 | 3 | │ │ │ │ ║ ║ │ ║ +-----+-----+-----+ ├─────┼─────┼─────┤ ╠═════╬═════╪═════╣ | 4 | 5 | 6 | │ │ │ │ ║ ║ │ ║ +-----+-----+-----+ ├─────┼─────┼─────┤ ╟─────╫─────┼─────╢ | 7 | 8 | 9 | │ │ │ │ ║ ║ │ ║ +-----+-----+-----+ └─────┴─────┴─────┘ ╚═════╩═════╧═════╝ "/BUFF=n" specifies how many spaces to position on either side of the vertical bars in the tables. Defaults to /BUFF=1. HTMSTRIP.DOC 7 Revised: 04/13/96 "/Iinitfile" says to read an initialization file with the file name "initfile". The file specification *must* contain a period. If no drive or path information is specified, the program will search for initfile beginning in your default subdirectory and then going throughout your DOS path. The use of an initialization file is optional. Initially defaults to "/IHTMSTRIP.INI". "/-I" (or "/INULL") says to skip loading the initialization file. "/Linitfile" says that the "&xxx;" and "<A>" etc lookup codes are found in a file other than from the default "/Iinitfile" file. This is primarily useful if you want to have a master *.INI file and a separate code lookup table. "/?" or "/HELP" or "HELP" shows you the syntax for the command. "/?&H" gives you a hexadecimal and decimal conversion table. Author: This program was written by Bruce Guthrie of Wayne Software. It is free for use and redistribution provided relevant documentation is kept with the program, no changes are made to the program or documentation, and it is not bundled with commercial programs or charged for separately. People who need to bundle it in for-sale packages must pay a $50 registration fee to "Wayne Software" at the following address. Additional information about this and other Wayne Software programs can be found in the file BRUCEymm.DOC which should be included in the original ZIP file. ("ymm" is replaced by the last digit of the year and the two digit month of the release. BRUCE508.DOC came out in August 1995. This same naming convention is used in naming the ZIP file that this program was included in.) Comments and suggestions can also be sent to: Bruce Guthrie Wayne Software 113 Sheffield St. Silver Spring, MD 20910 fax: (301) 588-8986 e-mail: bguthrie@nmaa.org http://hjs.geol.uib.no/guthrie/ See BRUCEymm.DOC file for additional contact information. Foreign users: Please provide an Internet e-mail address in all correspondence.